Project - Unsupervised Learning - Vehicle Recognition (Ratnesh Gupta)

Data Description:


The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Domain:


Object recognition

Context:


The purpose is to classify a given silhouette as one of three types of vehicle,using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:


  • All the features are geometric features extracted from the silhouette.
  • All are numeric in nature.

Learning Outcomes:


  • Exploratory Data Analysis.
  • Reduce number dimensions in the dataset with minimal information loss.
  • Train a model using Principle Components.

Objective:


Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.

Steps and tasks:


  1. Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm (10 marks)
  2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (10 points)
  3. Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn) (5 marks)
  4. Train a Support vector machine using the train set and get the accuracy on the test set (10 marks)
  5. Perform K-fold cross validation and get the cross validation score of the model (optional)
  6. Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data – (10 points)
  7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state) (20 marks)
  8. Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings (5 points)

Importing the Libraries

In [1]:
import os

import warnings
warnings.filterwarnings('ignore')

#import the necessary libraries
import numpy as np
import pandas as pd

#Importing libraries for visulization

import matplotlib.pyplot as plt
import seaborn as sns

#Library for Data Pre-processing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Imputer

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

#Traditional Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


#Decision Tree and other Ensemble Techniques
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier

#Library for Model Evaluation 
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, auc
from sklearn.metrics import roc_curve

#Other Libraries
from collections import Counter
from scipy import stats
from matplotlib.colors import ListedColormap

from sklearn.decomposition import PCA
from scipy.stats import zscore

Importing the Data into DataFrame

In [2]:
os.chdir("/home/ratnesh/Downloads/")


#load the csv file and make the data frame
vehicle_df = pd.read_csv("vehicle.csv")

#Copy of Original DataFrame
vehicle_df_copy=vehicle_df.copy()

Basic EDA

In [3]:
#display the first 5 rows of dataframe
vehicle_df.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [4]:
#Shape of data
#print("The dataframe has {} rows and {} columns".format(vehicle_df.shape[0],vehicle_df.shape[1]))
print('Total Rows = {}'.format(vehicle_df.shape[0]))
print('Total Cols = {}'.format(vehicle_df.shape[1]))
vehicle_df.shape
Total Rows = 846
Total Cols = 19
Out[4]:
(846, 19)
In [5]:
#DataType of each attributes
vehicle_df.dtypes
Out[5]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [6]:
#display the information of dataframe
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

Observation :-
As per above information except 'class' column, all columns are numeric data type and there are null values in some columns.
class column is our target column.

Missing Values Check

In [7]:
#display in each column how many null values are there
#vehicle_df.isnull().sum()
vehicle_df.apply(lambda x: sum(x.isnull()))
Out[7]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Observation :-
As per above, max null values is 6 which are in two columns 'radius_ratio', 'skewness_about'. so we have two options either we will drop those null values samples or we can impute those null values. Dropping null values is not a good way because we will lose some information but we will go with this option also we can use impute optios .

5 Point Summary

In [8]:
#display 5 point summary of dataframe
#vehicle_df.describe().transpose()
vehicle_df.describe().T
Out[8]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [9]:
sns.pairplot(vehicle_df,diag_kind='kde', hue='class')
plt.show()

Observation :-
As per pair plot, many columns are correlated and many columns have long tail, that is the indication of outliers. we will see down the line with the help of correlation matrix what's the strength of correlation and outliers are there or not.

In [10]:
#Corelation Matrix of attributes 
vehicle_df.corr()
Out[10]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
compactness 1.000000 0.689786 0.791707 0.691081 0.091779 0.148249 0.812770 -0.788736 0.814248 0.676143 0.764361 0.818674 0.585845 -0.250603 0.236685 0.157670 0.298528 0.365552
circularity 0.689786 1.000000 0.797180 0.625051 0.154283 0.251407 0.858265 -0.827246 0.856603 0.965729 0.806791 0.850863 0.935950 0.053080 0.144968 -0.011869 -0.106339 0.045652
distance_circularity 0.791707 0.797180 1.000000 0.771748 0.158684 0.264621 0.907949 -0.913020 0.896273 0.775149 0.865710 0.890541 0.706950 -0.227001 0.114665 0.266049 0.146027 0.333648
radius_ratio 0.691081 0.625051 0.771748 1.000000 0.665363 0.450486 0.738480 -0.792946 0.712744 0.571083 0.798294 0.725598 0.541325 -0.181520 0.049112 0.174469 0.382912 0.472339
pr.axis_aspect_ratio 0.091779 0.154283 0.158684 0.665363 1.000000 0.648861 0.103832 -0.183492 0.079566 0.127322 0.273738 0.089750 0.122454 0.152860 -0.058539 -0.032180 0.240201 0.267760
max.length_aspect_ratio 0.148249 0.251407 0.264621 0.450486 0.648861 1.000000 0.165998 -0.180053 0.161603 0.305943 0.319033 0.143745 0.189752 0.295638 0.015446 0.043491 -0.026184 0.143919
scatter_ratio 0.812770 0.858265 0.907949 0.738480 0.103832 0.165998 1.000000 -0.973504 0.992078 0.810017 0.951672 0.996328 0.800577 -0.028006 0.074376 0.213512 0.005171 0.118504
elongatedness -0.788736 -0.827246 -0.913020 -0.792946 -0.183492 -0.180053 -0.973504 1.000000 -0.950405 -0.776150 -0.938313 -0.956488 -0.766671 0.103535 -0.052243 -0.186027 -0.114846 -0.216769
pr.axis_rectangularity 0.814248 0.856603 0.896273 0.712744 0.079566 0.161603 0.992078 -0.950405 1.000000 0.813135 0.938182 0.992316 0.798522 -0.015711 0.083219 0.215200 -0.019066 0.099481
max.length_rectangularity 0.676143 0.965729 0.775149 0.571083 0.127322 0.305943 0.810017 -0.776150 0.813135 1.000000 0.746657 0.797485 0.866554 0.041283 0.136077 0.001660 -0.104437 0.076770
scaled_variance 0.764361 0.806791 0.865710 0.798294 0.273738 0.319033 0.951672 -0.938313 0.938182 0.746657 1.000000 0.949766 0.781016 0.112452 0.036165 0.196202 0.014434 0.086708
scaled_variance.1 0.818674 0.850863 0.890541 0.725598 0.089750 0.143745 0.996328 -0.956488 0.992316 0.797485 0.949766 1.000000 0.797318 -0.016642 0.077288 0.202398 0.006648 0.103839
scaled_radius_of_gyration 0.585845 0.935950 0.706950 0.541325 0.122454 0.189752 0.800577 -0.766671 0.798522 0.866554 0.781016 0.797318 1.000000 0.192245 0.166785 -0.056067 -0.225882 -0.118597
scaled_radius_of_gyration.1 -0.250603 0.053080 -0.227001 -0.181520 0.152860 0.295638 -0.028006 0.103535 -0.015711 0.041283 0.112452 -0.016642 0.192245 1.000000 -0.088736 -0.126686 -0.752437 -0.804793
skewness_about 0.236685 0.144968 0.114665 0.049112 -0.058539 0.015446 0.074376 -0.052243 0.083219 0.136077 0.036165 0.077288 0.166785 -0.088736 1.000000 -0.035154 0.115728 0.097293
skewness_about.1 0.157670 -0.011869 0.266049 0.174469 -0.032180 0.043491 0.213512 -0.186027 0.215200 0.001660 0.196202 0.202398 -0.056067 -0.126686 -0.035154 1.000000 0.077460 0.205115
skewness_about.2 0.298528 -0.106339 0.146027 0.382912 0.240201 -0.026184 0.005171 -0.114846 -0.019066 -0.104437 0.014434 0.006648 -0.225882 -0.752437 0.115728 0.077460 1.000000 0.893869
hollows_ratio 0.365552 0.045652 0.333648 0.472339 0.267760 0.143919 0.118504 -0.216769 0.099481 0.076770 0.086708 0.103839 -0.118597 -0.804793 0.097293 0.205115 0.893869 1.000000

Observation :-
many attributes are highly corelated

1. Data pre-processing

Missing Values Treatment

Since some columns have missing values so before building any model we have to handle missing values.
we have two option :-
1) Drop those missing values records.
2) Fill mean of attribute in missing values.

We will go with filling values as dropping the records will loose some information.

For fillig missing values we can go with following approach

  • 1) Fill the mean of column to missing values manually one by one.
  • 2) Use Imputer to fill the missing values.
  • 3) Generate the missing values with the help of highly co-realted attribute/column.

Note :- We can drop those records which have missing vales , but this way we will be loose some informat

To drop the null vaues records from the dataframe by below

vehicle_df.dropna(axis=0,inplace=True)

In [11]:
#Function for Null values treatment

def null_values(base_dataset):
    print("Shape of DataFrame before null treatment",base_dataset.shape)
    print("Null values count before treatment")
    print("===================================")
    print(base_dataset.isna().sum(),"\n")
    ## null value percentage     
    null_value_table=(base_dataset.isna().sum()/base_dataset.shape[0])*100
    ## null value percentage beyond threshold drop , else treat the columns    
    retained_columns=null_value_table[null_value_table<30].index
    # if any variable as null value greater than input(like 30% of the data) value than those variable are consider as drop
    drop_columns=null_value_table[null_value_table>30].index
    base_dataset.drop(drop_columns,axis=1,inplace=True)
    len(base_dataset.isna().sum().index)
    #cont=base_dataset.describe().columns
    cont=[col for col in base_dataset.select_dtypes(np.number).columns ]
    cat=[i for i in base_dataset.columns if i not in base_dataset.describe().columns]
    for i in cat:
        base_dataset[i].fillna(base_dataset[i].value_counts().index[0],inplace=True)
    for i in cont:
        base_dataset[i].fillna(base_dataset[i].mean(),inplace=True)
    print("Null values counts after treatment")
    print("===================================")
    print(base_dataset.isna().sum())
    print("\nShape of DataFrame after null treatment",base_dataset.shape)
    #return base_dataset,cat,con
In [12]:
null_values(vehicle_df)
Shape of DataFrame before null treatment (846, 19)
Null values count before treatment
===================================
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64 

Null values counts after treatment
===================================
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

Shape of DataFrame after null treatment (846, 19)

Without function we can perform missing values Treatment for each attributes having missing values by below way

Filling missing values with mean of the column/Attribute

vehicle_df["circularity"]=vehicle_df['circularity'].fillna(vehicle_df["circularity"].mean())

In [13]:
#display 5 point summary of new dataframe
#vehicle_df.describe().transpose()
vehicle_df.describe().T
Out[13]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.828775 6.133943 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.110451 15.740902 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.888095 33.400979 104.0 141.00 168.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.678910 7.882119 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.901775 33.195188 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.933728 7.811559 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.582444 2.588326 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.631079 31.355195 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.494076 176.457706 184.0 318.25 364.0 586.75 1018.0
scaled_radius_of_gyration 846.0 174.709716 32.546223 109.0 149.00 174.0 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.447743 7.468450 59.0 67.00 72.0 75.00 135.0
skewness_about 846.0 6.364286 4.903148 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.602367 8.930792 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.919527 6.152166 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0

2. Understanding the attributes

Analysis of each column with the help of plots

In [14]:
#Distribution of data

vehicle_df.hist( figsize=(15,15), color='red')
plt.show()
In [15]:
num_features=[col for col in vehicle_df.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.distplot(vehicle_df[col])
plt.show()
In [16]:
num_features=[col for col in vehicle_df.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.boxplot(vehicle_df[col]);
plt.show()
In [17]:
num_features=[col for col in vehicle_df.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.boxplot(vehicle_df['class'],vehicle_df[col]);
plt.show()

Observation :-

Columns with no outliers :-

  • 1) compactness
  • 2) circularity
  • 3) distance_circularity
  • 4) scatter_ratio
  • 5) elongatedness
  • 6) pr.axis_rectangularity
  • 7) max.length_rectangularity
  • 8) scaled_radius_of_gyration
  • 9) skewness_about.2
  • 10)hollows_ratio

Columns with outliers :-

  • 1) radius_ratio
  • 2) pr.axis_aspect_ratio
  • 3) max.length_aspect_ratio
  • 4) scaled_variance
  • 5) scaled_variance.1
  • 6) scaled_radius_of_gyration.1
  • 7) skewness_about
  • 8) skewness_about.1

Columns with normally distributed data :-

  • 1) compactness
  • 2) circularity

Columns with skewness in data :-

    a) Right skewed (mean > median)
    • 1) distance_circularity
    • 2) radius_ratio
    • 3) max.length_aspect_ratio
    • 4) scatter_ratio
    • 5) pr.axis_rectangularity
    • 6) max.length_rectangularity
    • 7) scaled_variance
    • 8) scaled_variance.1
    • 9) scaled_radius_of_gyration
    • 10) scaled_radius_of_gyration.1
    • 11) skewness_about
    • 12) skewness_about.1
    b) Left skewed (mean < median)
    • 1) elongatedness
    • 2) skewness_about.2
    • 3) hollows_ratio
In [18]:
#display how many are car,bus,van. 
vehicle_df['class'].value_counts()
Out[18]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [19]:
sns.countplot(vehicle_df['class'])
plt.show()

From above we can see that cars are most followed by bus and then vans.

Outliers Treatment

There are two way treatment of outliers
1) Drop the outliers records.
2) Replace the outliers values with mean/median of the column/attribute.

Below are there functions for same. We will go with replace outliers value with mean of that column

In [20]:
def outliers_transform_with_drop_record(base_dataset):
    num_features=[col for col in base_dataset.select_dtypes(np.number).columns ]
    print("Outliers in Dataset before Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")
        
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        for p in x:
            if p <ltv or p>utv:
                base_dataset.drop(base_dataset[base_dataset[cols]>utv].index, axis=0, inplace=True)
                base_dataset.drop(base_dataset[base_dataset[cols]<ltv].index, axis=0, inplace=True)
    
    print("\nOutliers in Dataset after Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")
In [21]:
#outliers_transform_with_drop_record(vehicle_df)
In [22]:
def outliers_transform_with_replace_mean(base_dataset):
    num_features=[col for col in base_dataset.select_dtypes(np.number).columns ]
    print("Outliers in Dataset before Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")
        
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        y=[]
        for p in x:
            if p <ltv or p>utv:
                y.append(np.mean(x))
            else:
                y.append(p)
        base_dataset[cols]=y
                
    print("\nOutliers in Dataset after Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")
In [23]:
outliers_transform_with_replace_mean(vehicle_df)
Outliers in Dataset before Treatment
====================================
Column  compactness 	 has  0  outliers
Column  circularity 	 has  0  outliers
Column  distance_circularity 	 has  0  outliers
Column  radius_ratio 	 has  3  outliers
Column  pr.axis_aspect_ratio 	 has  8  outliers
Column  max.length_aspect_ratio 	 has  13  outliers
Column  scatter_ratio 	 has  0  outliers
Column  elongatedness 	 has  0  outliers
Column  pr.axis_rectangularity 	 has  0  outliers
Column  max.length_rectangularity 	 has  0  outliers
Column  scaled_variance 	 has  1  outliers
Column  scaled_variance.1 	 has  2  outliers
Column  scaled_radius_of_gyration 	 has  0  outliers
Column  scaled_radius_of_gyration.1 	 has  15  outliers
Column  skewness_about 	 has  12  outliers
Column  skewness_about.1 	 has  1  outliers
Column  skewness_about.2 	 has  0  outliers
Column  hollows_ratio 	 has  0  outliers

Outliers in Dataset after Treatment
====================================
Column  compactness 	 has  0  outliers
Column  circularity 	 has  0  outliers
Column  distance_circularity 	 has  0  outliers
Column  radius_ratio 	 has  0  outliers
Column  pr.axis_aspect_ratio 	 has  0  outliers
Column  max.length_aspect_ratio 	 has  0  outliers
Column  scatter_ratio 	 has  0  outliers
Column  elongatedness 	 has  0  outliers
Column  pr.axis_rectangularity 	 has  0  outliers
Column  max.length_rectangularity 	 has  0  outliers
Column  scaled_variance 	 has  0  outliers
Column  scaled_variance.1 	 has  0  outliers
Column  scaled_radius_of_gyration 	 has  0  outliers
Column  scaled_radius_of_gyration.1 	 has  0  outliers
Column  skewness_about 	 has  0  outliers
Column  skewness_about.1 	 has  0  outliers
Column  skewness_about.2 	 has  0  outliers
Column  hollows_ratio 	 has  0  outliers

Observation:-
Now there no ouliers present in Data replacing outlier value with mean of the column

In [24]:
#now what is the shape of dataframe
print("after outliers treatment shape of dataframe:",vehicle_df.shape)
after outliers treatment shape of dataframe: (846, 19)

Observation:-
No change in DataFrame shape after missing value treatment and outlier treatment

In [25]:
sns.set(font_scale=1.2)
#find the correlation between independent variables
plt.figure(figsize=(25,13))
sns.heatmap(vehicle_df.corr(),annot=True)
plt.show()

Observation :-
Our objective is to reocgnize whether an object is a van or bus or car based on some input features. so our main assumption is there is little or no multicollinearity between the features.

if two features are highly correlated then there is no use of using both features, in this situation, we can drop one feature.

From heatmap we can see the correlation of features.

In above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has 1 correlation and many other features also there which having more than 0.9 correlation.

so we will drop those columns whose correlation is +-0.9 or above.

There are 8 highly correlated columns :-

  • max.length_rectangularity
  • scaled_radius_of_gyration
  • skewness_about.2
  • scatter_ratio
  • elongatedness
  • pr.axis_rectangularity
  • scaled_variance
  • scaled_variance.1

Here we have two option
1) we can drop those above eight columns manually 2) we can apply pca

let pca to be decided how it will explain above data which is in high dimension with smaller number of samples. we will see both approaches.

In [26]:
vehicle_df.replace({'car':0,'bus':1,'van':2},inplace=True)

SVM Classifier (Before PCA)

In [27]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        print("Train Result:\n=============")
        print(f"accuracy score: {accuracy_score(y_train, pred):.4f}\n")
        #print(f"Classification Report: \n \tPrecision: {precision_score(y_train, pred,average=None)}\n\tRecall Score: {recall_score(y_train, pred,average=None)}\n\tF1 score: {f1_score(y_train, pred,average=None)}\n")
        print(f"Confusion Matrix:\n=================\n {confusion_matrix(y_train, clf.predict(X_train))}\n")
        print("Classification Report:\n======================\n",classification_report(y_train, pred))
        
    elif train==False:
        pred = clf.predict(X_test)
        print("Test Result:\n============")        
        print(f"accuracy score: {accuracy_score(y_test, pred)}\n")
        #print(f"Classification Report: \n \tPrecision: {precision_score(y_test, pred,average=None)}\n\tRecall Score: {recall_score(y_test, pred,average=None)}\n\tF1 score: {f1_score(y_test, pred,average=None)}\n")
        print(f"Confusion Matrix:\n===============\n {confusion_matrix(y_test, pred)}\n")
        print("Classification Report:\n======================\n",classification_report(y_test, pred))
In [28]:
#now separate the dataframe into dependent and independent variables
X = vehicle_df.drop('class',axis=1)
Y = vehicle_df['class']
print("shape of X :", X.shape)
print("shape of Y :", Y.shape)
shape of X : (846, 18)
shape of Y : (846,)
In [29]:
from sklearn.model_selection import cross_val_score, train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=5)

Linear Kernel SVM

In [30]:
from sklearn.svm import SVC

lsvm = SVC(kernel='linear')
lsvm.fit(X_train, y_train)

print_score(lsvm, X_train, y_train, X_test, y_test, train=True)
print_score(lsvm, X_train, y_train, X_test, y_test, train=False)


lsvm_accuracy=accuracy_score(y_test, lsvm.predict(X_test))
Train Result:
=============
accuracy score: 0.9713

Confusion Matrix:
=================
 [[287   6   3]
 [  7 146   0]
 [  1   0 142]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.97      0.97      0.97       296
           1       0.96      0.95      0.96       153
           2       0.98      0.99      0.99       143

    accuracy                           0.97       592
   macro avg       0.97      0.97      0.97       592
weighted avg       0.97      0.97      0.97       592

Test Result:
============
accuracy score: 0.937007874015748

Confusion Matrix:
===============
 [[127   5   1]
 [  3  61   1]
 [  5   1  50]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.94      0.95      0.95       133
           1       0.91      0.94      0.92        65
           2       0.96      0.89      0.93        56

    accuracy                           0.94       254
   macro avg       0.94      0.93      0.93       254
weighted avg       0.94      0.94      0.94       254

Ploynomial Kernel SVM

In [31]:
from sklearn.svm import SVC

psvm = SVC(kernel='poly', degree=2, gamma='auto')
psvm.fit(X_train, y_train)

print_score(psvm, X_train, y_train, X_test, y_test, train=True)
print_score(psvm, X_train, y_train, X_test, y_test, train=False)

lsvm_accuracy=accuracy_score(y_test, psvm.predict(X_test))
Train Result:
=============
accuracy score: 1.0000

Confusion Matrix:
=================
 [[296   0   0]
 [  0 153   0]
 [  0   0 143]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       296
           1       1.00      1.00      1.00       153
           2       1.00      1.00      1.00       143

    accuracy                           1.00       592
   macro avg       1.00      1.00      1.00       592
weighted avg       1.00      1.00      1.00       592

Test Result:
============
accuracy score: 0.9448818897637795

Confusion Matrix:
===============
 [[128   1   4]
 [  2  63   0]
 [  5   2  49]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.95      0.96      0.96       133
           1       0.95      0.97      0.96        65
           2       0.92      0.88      0.90        56

    accuracy                           0.94       254
   macro avg       0.94      0.94      0.94       254
weighted avg       0.94      0.94      0.94       254

Redial Kenel SVM

In [32]:
from sklearn.svm import SVC

rsvm = SVC(kernel='rbf', gamma=1)
rsvm.fit(X_train, y_train)

print_score(rsvm, X_train, y_train, X_test, y_test, train=True)
print_score(rsvm, X_train, y_train, X_test, y_test, train=False)

rsvm_accuracy=accuracy_score(y_test, rsvm.predict(X_test))
Train Result:
=============
accuracy score: 1.0000

Confusion Matrix:
=================
 [[296   0   0]
 [  0 153   0]
 [  0   0 143]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       296
           1       1.00      1.00      1.00       153
           2       1.00      1.00      1.00       143

    accuracy                           1.00       592
   macro avg       1.00      1.00      1.00       592
weighted avg       1.00      1.00      1.00       592

Test Result:
============
accuracy score: 0.5236220472440944

Confusion Matrix:
===============
 [[133   0   0]
 [ 65   0   0]
 [ 56   0   0]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.52      1.00      0.69       133
           1       0.00      0.00      0.00        65
           2       0.00      0.00      0.00        56

    accuracy                           0.52       254
   macro avg       0.17      0.33      0.23       254
weighted avg       0.27      0.52      0.36       254

SVM on Scaled Data

In [40]:
from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler()
X_std = sc.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.3, random_state=5)
In [41]:
print("=======================Linear Kernel SVM==========================")

from sklearn.svm import SVC

lsvm = SVC(kernel='linear')
lsvm.fit(X_train, y_train)

print_score(lsvm, X_train, y_train, X_test, y_test, train=True)
print_score(lsvm, X_train, y_train, X_test, y_test, train=False)

lsvm_accuracy=accuracy_score(y_test, lsvm.predict(X_test))

print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC

psvm = SVC(kernel='poly', degree=2, gamma='auto')
psvm.fit(X_train, y_train)

print_score(psvm, X_train, y_train, X_test, y_test, train=True)
print_score(psvm, X_train, y_train, X_test, y_test, train=False)

psvm_accuracy=accuracy_score(y_test, psvm.predict(X_test))

print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC

rsvm = SVC(kernel='rbf', gamma=1)
rsvm.fit(X_train, y_train)

print_score(rsvm, X_train, y_train, X_test, y_test, train=True)
print_score(rsvm, X_train, y_train, X_test, y_test, train=False)

rsvm_accuracy=accuracy_score(y_test, rsvm.predict(X_test))
=======================Linear Kernel SVM==========================
Train Result:
=============
accuracy score: 0.9257

Confusion Matrix:
=================
 [[273  13  10]
 [ 13 138   2]
 [  6   0 137]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.93      0.92      0.93       296
           1       0.91      0.90      0.91       153
           2       0.92      0.96      0.94       143

    accuracy                           0.93       592
   macro avg       0.92      0.93      0.92       592
weighted avg       0.93      0.93      0.93       592

Test Result:
============
accuracy score: 0.8818897637795275

Confusion Matrix:
===============
 [[114  10   9]
 [  6  58   1]
 [  2   2  52]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.93      0.86      0.89       133
           1       0.83      0.89      0.86        65
           2       0.84      0.93      0.88        56

    accuracy                           0.88       254
   macro avg       0.87      0.89      0.88       254
weighted avg       0.89      0.88      0.88       254

=======================Polynomial Kernel SVM==========================
Train Result:
=============
accuracy score: 0.5118

Confusion Matrix:
=================
 [[296   0   0]
 [146   7   0]
 [143   0   0]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.51      1.00      0.67       296
           1       1.00      0.05      0.09       153
           2       0.00      0.00      0.00       143

    accuracy                           0.51       592
   macro avg       0.50      0.35      0.25       592
weighted avg       0.51      0.51      0.36       592

Test Result:
============
accuracy score: 0.531496062992126

Confusion Matrix:
===============
 [[133   0   0]
 [ 63   2   0]
 [ 56   0   0]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.53      1.00      0.69       133
           1       1.00      0.03      0.06        65
           2       0.00      0.00      0.00        56

    accuracy                           0.53       254
   macro avg       0.51      0.34      0.25       254
weighted avg       0.53      0.53      0.38       254

=======================Radial Kernel SVM==========================
Train Result:
=============
accuracy score: 0.9797

Confusion Matrix:
=================
 [[289   3   4]
 [  1 151   1]
 [  3   0 140]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.99      0.98      0.98       296
           1       0.98      0.99      0.98       153
           2       0.97      0.98      0.97       143

    accuracy                           0.98       592
   macro avg       0.98      0.98      0.98       592
weighted avg       0.98      0.98      0.98       592

Test Result:
============
accuracy score: 0.9409448818897638

Confusion Matrix:
===============
 [[126   0   7]
 [  1  63   1]
 [  4   2  50]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.96      0.95      0.95       133
           1       0.97      0.97      0.97        65
           2       0.86      0.89      0.88        56

    accuracy                           0.94       254
   macro avg       0.93      0.94      0.93       254
weighted avg       0.94      0.94      0.94       254

In [42]:
result = pd.DataFrame({'Model' : ['SVM Linear', 'SVM Polynomial', 'SVM Redial'], 
                       'Test Accuracy' : [lsvm_accuracy, psvm_accuracy, rsvm_accuracy],
                      })
result
Out[42]:
Model Test Accuracy
0 SVM Linear 0.881890
1 SVM Polynomial 0.531496
2 SVM Redial 0.940945

Support Vector Machine Hyperparameter tuning

In [118]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100], 
              'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001], 
              'kernel': ['rbf', 'poly', 'linear']} 

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5, iid=True)

grid.fit(X_train, y_train)

print_score(grid, X_train, y_train, X_test, y_test, train=True)
print_score(grid, X_train, y_train, X_test, y_test, train=False)
Fitting 5 folds for each of 126 candidates, totalling 630 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Train Result:
=============
accuracy score: 1.0000

Confusion Matrix:
=================
 [[385   0   0]
 [  0 200   0]
 [  0   0 177]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       385
           1       1.00      1.00      1.00       200
           2       1.00      1.00      1.00       177

    accuracy                           1.00       762
   macro avg       1.00      1.00      1.00       762
weighted avg       1.00      1.00      1.00       762

Test Result:
============
accuracy score: 0.9761904761904762

Confusion Matrix:
===============
 [[43  1  0]
 [ 1 17  0]
 [ 0  0 22]]

Classification Report:
======================
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        44
           1       0.94      0.94      0.94        18
           2       1.00      1.00      1.00        22

    accuracy                           0.98        84
   macro avg       0.97      0.97      0.97        84
weighted avg       0.98      0.98      0.98        84

[Parallel(n_jobs=1)]: Done 630 out of 630 | elapsed: 11.1min finished

K-Fold Cross Validation

In [43]:
from sklearn.model_selection import KFold, cross_val_score


kfold = KFold(n_splits= 10, random_state = 1)

#instantiate the object
svc = SVC(kernel='linear') 


#now we will train the model with raw data

results = cross_val_score(estimator = svc, X = X_train, y = y_train, cv = kfold)

print(results,"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean()*100, results.std()*100 * 2))

kf_accuracy=results.mean()
[0.9        0.98333333 0.96610169 0.89830508 0.91525424 0.91525424
 0.93220339 0.83050847 0.91525424 0.88135593] 

Accuracy: 91.38 (+/- 8.08)

Repeated Kfold Cross Validation

In [44]:
from sklearn.model_selection import RepeatedKFold

X = vehicle_df.drop('class',axis=1).values
y = vehicle_df['class'].values

accuracies = []
#lr = LogisticRegression(random_state = 1)
svc = SVC(kernel='linear') 

rkf = RepeatedKFold(n_splits = 10, n_repeats= 3, random_state = 1)

for train_index, test_index in rkf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    svc.fit(X_train, y_train)
    accuracies.append(accuracy_score(y_test, svc.predict(X_test)))

print(np.round(accuracies, 3),"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracies)*100, np.std(accuracies)*100 * 2))

rkf_accuracy=np.mean(accuracies)
[0.988 0.929 0.953 0.918 0.941 0.953 0.917 0.905 0.964 0.952 0.906 0.976
 0.906 0.953 0.941 0.941 0.94  0.94  0.94  0.857 0.906 0.918 0.906 0.953
 0.953 0.941 0.94  0.94  0.929 0.964] 

Accuracy: 93.58 (+/- 5.14)
In [46]:
result = pd.DataFrame({'Model' : ['Linear SVM', 'Linear SVM K-Fold', 'Linear SVM Repeated K-Fold'], 
                       'Accuracy' : [lsvm_accuracy, kf_accuracy, rkf_accuracy],
                      })
result
Out[46]:
Model Accuracy
0 Linear SVM 0.881890
1 Linear SVM K-Fold 0.913757
2 Linear SVM Repeated K-Fold 0.935761
In [47]:
#now sclaed the features attribute and replace the target attribute values with number
X = vehicle_df.drop('class',axis=1)
y = vehicle_df['class']

X_scaled = X.apply(zscore)

Principal Component Analysis(PCA)

Principal Component Analysis is an unsupervised learning class of statistical techniques used to explain data in high dimension using small number of variables called the principal components.

Principal components are the linear combinations of the original variables in the dataset. As it will explain high dimension data with small number of variables.

The big disadvantage is we cannot do interpretation with the model.In other words model with pca will become blackbox.

For pca below are the steps

  • 1) find the covariance matrix.
  • 2) from covariance matrix we have to find eigen vectors and eigen values.
  • 3) we have to sort the eigen vector by decreasing eigen values and choose k eigen vectors with the largest eigen values.

6.Use PCA from Scikit learn, extract Principal Components that capture 95% of the variance in the data

In [48]:
#make the covariance matrix and we have 18 independent features so aur covariance matrix is 18*18 matrix
cov_matrix = np.cov(X_scaled,rowvar=False)
print("cov_matrix shape:",cov_matrix.shape)
print("Covariance_matrix",cov_matrix)
cov_matrix shape: (18, 18)
Covariance_matrix [[ 1.00118343  0.68623251  0.79084412  0.72280981  0.19247831  0.49696111
   0.81319623 -0.78957587  0.81459888  0.67694334  0.77154429  0.811318
   0.58584865 -0.24940294  0.19846556  0.1569967   0.2988797   0.36598446]
 [ 0.68623251  1.00118343  0.79395399  0.63918872  0.20304921  0.55954841
   0.84921058 -0.82287347  0.84597164  0.96308094  0.80441647  0.83345344
   0.92798524  0.06915564  0.1372014  -0.01018906 -0.1057698   0.04537164]
 [ 0.79084412  0.79395399  1.00118343  0.79524977  0.24426851  0.66457987
   0.90547061 -0.91251368  0.89418513  0.77558624  0.87119625  0.88804391
   0.70678835 -0.23134244  0.09969208  0.26292611  0.14573497  0.3324884 ]
 [ 0.72280981  0.63918872  0.79524977  1.00118343  0.65301362  0.46700327
   0.77056309 -0.82663319  0.74482646  0.58056101  0.78808655  0.76346677
   0.55167826 -0.39076107  0.03636776  0.18003636  0.40613306  0.49226262]
 [ 0.19247831  0.20304921  0.24426851  0.65301362  1.00118343  0.15186016
   0.19381198 -0.29829236  0.16237218  0.14838164  0.20946714  0.19484658
   0.14918947 -0.32231686 -0.05644465 -0.02164685  0.40050021  0.41569951]
 [ 0.49696111  0.55954841  0.66457987  0.46700327  0.15186016  1.00118343
   0.48805486 -0.50184217  0.4858937   0.64283553  0.40389309  0.46117004
   0.39720516 -0.33732483  0.08103367  0.14092443  0.0806705   0.41115367]
 [ 0.81319623  0.84921058  0.90547061  0.77056309  0.19381198  0.48805486
   1.00118343 -0.97187169  0.99054075  0.80931225  0.96154912  0.98552465
   0.80021174  0.00932566  0.06455227  0.21271641  0.00517279  0.11858838]
 [-0.78957587 -0.82287347 -0.91251368 -0.82663319 -0.29829236 -0.50184217
  -0.97187169  1.00118343 -0.9502004  -0.77643696 -0.94905732 -0.95342507
  -0.76693543  0.08046474 -0.04659615 -0.18464081 -0.11486327 -0.21697531]
 [ 0.81459888  0.84597164  0.89418513  0.74482646  0.16237218  0.4858937
   0.99054075 -0.9502004   1.00118343  0.81240688  0.94862541  0.97941966
   0.79801083  0.02596713  0.07279648  0.21421199 -0.01901199  0.09930879]
 [ 0.67694334  0.96308094  0.77558624  0.58056101  0.14838164  0.64283553
   0.80931225 -0.77643696  0.81240688  1.00118343  0.75155333  0.79389837
   0.86744991  0.05297754  0.13110243  0.00428108 -0.10437712  0.07686047]
 [ 0.77154429  0.80441647  0.87119625  0.78808655  0.20946714  0.40389309
   0.96154912 -0.94905732  0.94862541  0.75155333  1.00118343  0.94945091
   0.78624114  0.02607731  0.02466438  0.19840056  0.01533732  0.0873645 ]
 [ 0.811318    0.83345344  0.88804391  0.76346677  0.19484658  0.46117004
   0.98552465 -0.95342507  0.97941966  0.79389837  0.94945091  1.00118343
   0.78771033  0.00859878  0.06647746  0.20586383  0.01667922  0.11882622]
 [ 0.58584865  0.92798524  0.70678835  0.55167826  0.14918947  0.39720516
   0.80021174 -0.76693543  0.79801083  0.86744991  0.78624114  0.78771033
   1.00118343  0.21577729  0.16343169 -0.05559633 -0.22513165 -0.1182971 ]
 [-0.24940294  0.06915564 -0.23134244 -0.39076107 -0.32231686 -0.33732483
   0.00932566  0.08046474  0.02596713  0.05297754  0.02607731  0.00859878
   0.21577729  1.00118343 -0.05930574 -0.12538611 -0.83728645 -0.90574717]
 [ 0.19846556  0.1372014   0.09969208  0.03636776 -0.05644465  0.08103367
   0.06455227 -0.04659615  0.07279648  0.13110243  0.02466438  0.06647746
   0.16343169 -0.05930574  1.00118343 -0.04171136  0.08749278  0.06341058]
 [ 0.1569967  -0.01018906  0.26292611  0.18003636 -0.02164685  0.14092443
   0.21271641 -0.18464081  0.21421199  0.00428108  0.19840056  0.20586383
  -0.05559633 -0.12538611 -0.04171136  1.00118343  0.07485774  0.20129124]
 [ 0.2988797  -0.1057698   0.14573497  0.40613306  0.40050021  0.0806705
   0.00517279 -0.11486327 -0.01901199 -0.10437712  0.01533732  0.01667922
  -0.22513165 -0.83728645  0.08749278  0.07485774  1.00118343  0.89389629]
 [ 0.36598446  0.04537164  0.3324884   0.49226262  0.41569951  0.41115367
   0.11858838 -0.21697531  0.09930879  0.07686047  0.0873645   0.11882622
  -0.1182971  -0.90574717  0.06341058  0.20129124  0.89389629  1.00118343]]
In [49]:
#now with the help of above covariance matrix we will find eigen value and eigen vectors
pca = PCA(n_components=18)
pca.fit(X_scaled)
Out[49]:
PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [50]:
#display explained variance ratio
pca.explained_variance_ratio_
Out[50]:
array([5.41332535e-01, 1.86172706e-01, 6.62054129e-02, 6.29206288e-02,
       4.91274649e-02, 3.70776079e-02, 1.76463327e-02, 1.25613396e-02,
       7.01683438e-03, 4.39378120e-03, 4.10273471e-03, 3.53135190e-03,
       2.17271683e-03, 1.72291754e-03, 1.54469318e-03, 1.10663694e-03,
       1.04155361e-03, 3.22751662e-04])
In [51]:
#display explained variance
pca.explained_variance_
Out[51]:
array([9.75551697e+00, 3.35507453e+00, 1.19310773e+00, 1.13391164e+00,
       8.85340870e-01, 6.68186762e-01, 3.18009887e-01, 2.26371692e-01,
       1.26452490e-01, 7.91816569e-02, 7.39366203e-02, 6.36395582e-02,
       3.91551857e-02, 3.10492168e-02, 2.78373819e-02, 1.99430383e-02,
       1.87701520e-02, 5.81640509e-03])
In [52]:
#display principal components
pca.components_
Out[52]:
array([[ 2.72419690e-01,  2.87361811e-01,  3.02294487e-01,
         2.69729207e-01,  9.77923893e-02,  1.94496082e-01,
         3.10279044e-01, -3.08862952e-01,  3.07260318e-01,
         2.78104234e-01,  2.99918806e-01,  3.06427053e-01,
         2.63226076e-01, -4.22914409e-02,  3.61461444e-02,
         5.88057705e-02,  3.77737374e-02,  8.45320455e-02],
       [-8.72404150e-02,  1.32011548e-01, -4.61279713e-02,
        -1.97855810e-01, -2.57465319e-01, -1.07717657e-01,
         7.49939645e-02, -1.29509622e-02,  8.74393517e-02,
         1.21794527e-01,  7.66546849e-02,  7.21857931e-02,
         2.10366262e-01,  5.04182948e-01, -1.63273149e-02,
        -9.28941160e-02, -5.01539937e-01, -5.07489065e-01],
       [-3.67147841e-02, -2.01757077e-01,  6.41009526e-02,
         5.36110291e-02, -6.73210975e-02, -1.48021404e-01,
         1.09718662e-01, -9.12276965e-02,  1.06445390e-01,
        -2.13772682e-01,  1.44195011e-01,  1.10087363e-01,
        -2.03683802e-01,  7.29028715e-02, -5.55309485e-01,
         6.73484971e-01, -6.16791010e-02, -4.07703018e-02],
       [ 1.40534235e-01, -3.68680219e-02,  1.07830429e-01,
        -2.54482192e-01, -6.13571672e-01,  2.72381835e-01,
         5.71343312e-03,  6.54776756e-02,  3.09522543e-02,
         4.17738461e-02, -6.40401698e-02, -1.28910649e-03,
        -8.35948072e-02, -1.16544223e-01,  4.81166601e-01,
         4.22903604e-01, -2.53727852e-02,  9.55571889e-02],
       [ 1.39609381e-01, -1.37454161e-01, -8.06857872e-02,
         1.31629086e-01,  1.21960402e-01, -6.42270643e-01,
         8.70530875e-02, -8.03750497e-02,  8.17629715e-02,
        -2.51520535e-01,  1.46470932e-01,  1.12089922e-01,
        -4.33887119e-03,  1.37840689e-01,  5.58837300e-01,
         1.21936879e-01,  1.84578664e-01, -1.10151317e-01],
       [ 2.61387350e-01, -6.61251628e-02, -1.60793722e-02,
        -1.43073556e-01, -5.75838357e-01, -2.85218514e-01,
         9.66680832e-02, -7.49421634e-02,  1.05126994e-01,
        -7.32601128e-02,  1.26006994e-01,  1.16146890e-01,
        -6.56114032e-02, -1.32684583e-01, -3.27746198e-01,
        -4.71027549e-01,  2.78995537e-01,  6.20917732e-02],
       [ 2.02925232e-01, -3.91642399e-01,  1.62269640e-01,
         1.68557117e-01,  8.54116516e-02,  3.95785882e-01,
         9.21797504e-02, -1.03856649e-01,  9.06812989e-02,
        -3.56983457e-01,  7.71214852e-02,  8.48754076e-02,
        -4.54432881e-01,  8.04780048e-02,  1.25769819e-01,
        -3.03478988e-01, -2.58098519e-01, -1.76463967e-01],
       [ 7.65375629e-01,  6.67297800e-02, -2.79295659e-01,
        -1.02165912e-01,  1.86947079e-01,  4.64712961e-02,
        -6.80752357e-02,  1.99363707e-01, -1.62557411e-02,
         2.15940695e-01, -1.90429226e-01, -3.74810803e-02,
        -1.48803312e-01,  3.16120230e-01, -1.13890874e-01,
         1.15822353e-01,  8.24076708e-02, -5.23272744e-04],
       [ 3.54612015e-01,  5.10648804e-02,  9.43818785e-02,
         2.54403699e-01, -3.91713464e-02, -1.37539341e-01,
        -1.32312002e-01,  2.89493364e-01, -9.46475436e-02,
        -1.82867465e-01,  3.33510607e-02, -1.37443587e-01,
         2.95591251e-01, -5.53170634e-01, -6.27750362e-02,
         5.10657883e-02, -3.72283357e-01, -2.74467160e-01],
       [ 1.32975175e-01, -1.80252847e-01,  1.36246443e-01,
         5.45855423e-02, -8.90367793e-02,  2.27618846e-01,
        -1.68799385e-01, -6.76740731e-02, -2.70415684e-01,
        -3.43510675e-01,  2.68231044e-01, -2.79354571e-01,
         5.22363000e-01,  3.91487017e-01, -4.88191268e-02,
         2.06833848e-02,  2.23554769e-01,  1.24293252e-01],
       [-1.17177562e-01,  5.76107511e-02, -5.87787173e-01,
         5.34280720e-01, -2.50142264e-01,  1.67797087e-01,
        -5.10293627e-02,  6.22227568e-02, -7.91727852e-02,
         1.80896272e-01,  4.01079158e-01, -8.27021772e-02,
        -1.60823340e-01,  6.03274682e-03,  4.15061205e-02,
         3.63926320e-02,  8.36086669e-02, -1.10290644e-01],
       [-1.39189677e-02, -1.43473754e-01, -5.97931834e-01,
        -1.54414846e-01,  6.26273757e-02,  2.01065062e-01,
         1.90871886e-01, -4.52006015e-02,  2.60966715e-01,
        -3.30237018e-01, -1.79569725e-01,  3.13954406e-01,
         3.93165690e-01, -1.34784004e-01, -3.98386539e-03,
         9.31204983e-03, -1.06912140e-01,  1.50927316e-01],
       [-6.84779068e-02, -5.08592456e-02,  1.44701579e-01,
         3.76466515e-01, -1.55998978e-01, -1.44432971e-01,
        -3.75706177e-02,  4.84244230e-01,  1.60133828e-01,
        -4.89370804e-04, -1.27226756e-01,  1.83527574e-01,
         2.45158455e-02,  2.93584313e-01, -1.56997563e-03,
        -8.38185013e-02, -2.73310836e-01,  5.52201272e-01],
       [-8.27580369e-02, -3.28409292e-01,  8.17549693e-02,
        -3.05208091e-01,  1.84988751e-01,  9.49634126e-02,
        -7.30196406e-02,  5.60167580e-01,  1.15120495e-01,
         2.04328298e-01,  4.64227989e-01,  2.78077324e-01,
         7.59014199e-02, -7.85834372e-02,  8.58061230e-03,
        -1.05797197e-02,  1.95289774e-01, -1.58134995e-01],
       [ 9.12525851e-02, -1.02460783e-01, -1.43138376e-01,
        -3.03398349e-01,  1.25662866e-01, -1.53978634e-01,
         3.37969861e-02, -1.85456839e-01,  3.16185320e-02,
         1.34688761e-01,  4.68347343e-01, -3.61502532e-01,
        -8.14043330e-02, -9.35977407e-02,  4.07311490e-02,
        -3.82945298e-02, -4.51312726e-01,  4.49698304e-01],
       [-5.07757125e-02, -2.88509910e-01,  3.07407859e-02,
         1.22064558e-01, -2.22105068e-02,  5.88501741e-03,
         1.37953244e-01,  5.88939399e-02,  6.40245142e-01,
         1.64123004e-01, -1.87063110e-01, -5.88805021e-01,
         1.21887071e-01,  6.93813427e-03, -1.40875918e-02,
        -3.60686135e-04,  1.49059399e-01, -1.39483044e-01],
       [ 4.71929472e-02, -6.41242030e-01, -4.44030604e-03,
         1.58023191e-01, -5.98317778e-02, -9.41350020e-02,
        -7.62879528e-02, -2.90963858e-01, -2.95315857e-01,
         4.86959769e-01, -2.14785498e-01,  1.93906035e-01,
         1.90223275e-01, -5.85564415e-02, -5.76390793e-03,
         9.38864122e-03, -1.04006400e-01, -6.57995889e-03],
       [-1.61012611e-02, -4.95871004e-02,  9.20198309e-03,
         4.11777827e-03,  7.83015016e-03,  1.98458799e-03,
         8.59188753e-01,  2.53206863e-01, -4.07378476e-01,
         2.24842778e-02, -3.73902013e-02, -1.60582624e-01,
         1.72269182e-02,  5.66112519e-03, -1.27810641e-03,
        -9.71376865e-03,  2.68461437e-02, -1.69266167e-03]])
In [53]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_)
plt.xlabel("eigen value/components")
plt.ylabel("variation explained")
plt.show()
In [54]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("eigen value/components")
plt.ylabel("cummalative of variation explained")
plt.show()

From above we can see that 8 dimension are able to explain 95%variance of data. so we will use first 8 principal components

In [55]:
#use first 8 principal components
pca_8c = PCA(n_components=8)
pca_8c.fit(X_scaled)
Out[55]:
PCA(copy=True, iterated_power='auto', n_components=8, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [56]:
#transform the raw data which is in 18 dimension into 8 new dimension with pca
X_scaled_pca_8c = pca_8c.transform(X_scaled)
In [57]:
#display the shape of new_vehicle_df_pca_independent_attr
X_scaled_pca_8c.shape
Out[57]:
(846, 8)

now before apply pca with 8 dimension which are explaining more than 95% variantion of data we will make model on raw data after that we will make model with pca and then we will compare both models.

3.Split the data into train and test

In [58]:
#now split the data into 80:20 ratio
rawdata_X_train,rawdata_X_test,rawdata_y_train,rawdata_y_test = train_test_split(X_scaled,Y,test_size=0.20,random_state=1)
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(X_scaled_pca_8c,Y,test_size=0.20,random_state=1)
In [59]:
print("shape of rawdata_X_train",rawdata_X_train.shape)
print("shape of rawdata_y_train",rawdata_y_train.shape)
print("shape of rawdata_X_test",rawdata_X_test.shape)
print("shape of rawdata_y_test",rawdata_y_test.shape)
print("--------------------------------------------")
print("shape of pca_X_train",pca_X_train.shape)
print("shape of pca_y_train",pca_y_train.shape)
print("shape of pca_X_test",pca_X_test.shape)
print("shape of pca_y_test",pca_y_test.shape)
shape of rawdata_X_train (676, 18)
shape of rawdata_y_train (676,)
shape of rawdata_X_test (170, 18)
shape of rawdata_y_test (170,)
--------------------------------------------
shape of pca_X_train (676, 8)
shape of pca_y_train (676,)
shape of pca_X_test (170, 8)
shape of pca_y_test (170,)

Support Vector Machine Model (SVC)

Without PCA

In [76]:
from sklearn.model_selection import KFold, cross_val_score


kfold = KFold(n_splits= 10, random_state = 1)

svc = SVC() #instantiate the object

#now we will train the model with raw data

results = cross_val_score(estimator = svc, X = rawdata_X_train, y = rawdata_y_train, cv = kfold)

print(results,"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean()*100, results.std()*100 * 2))

sns.boxplot(results)
plt.show()
[0.98529412 0.91176471 0.94117647 0.95588235 0.98529412 0.98529412
 0.92537313 0.95522388 0.88059701 0.97014925] 

Accuracy: 94.96 (+/- 6.68)
In [61]:
svc.fit(rawdata_X_train,rawdata_y_train)

print("Raw Data Training Accuracy :\t ", svc.score(rawdata_X_train, rawdata_y_train))

raw_train_accuracy=svc.score(rawdata_X_train, rawdata_y_train)

#Scoring the model on test_data
print("Raw Data Testing Accuracy :\t  ",  svc.score(rawdata_X_test, rawdata_y_test))

raw_test_accuracy=svc.score(rawdata_X_test, rawdata_y_test)

y_pred = svc.predict(rawdata_X_test)
Raw Data Training Accuracy :	  0.9718934911242604
Raw Data Testing Accuracy :	   0.9647058823529412
In [62]:
print(classification_report(rawdata_y_test, svc.predict(rawdata_X_test)))
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        90
           1       0.97      0.97      0.97        37
           2       0.95      0.95      0.95        43

    accuracy                           0.96       170
   macro avg       0.96      0.96      0.96       170
weighted avg       0.96      0.96      0.96       170

With PCA

In [78]:
#now fit the model on pca data with new dimension

from sklearn.model_selection import KFold, cross_val_score

kfold = KFold(n_splits= 10, random_state = 1)

svc = SVC() #instantiate the object

#now train the model with pca data with new dimension

pca_results = cross_val_score(estimator = svc, X = pca_X_train, y = pca_y_train, cv = kfold)

print(pca_results,"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (pca_results.mean()*100, pca_results.std()*100 * 2))

sns.boxplot(pca_results)
plt.show()
[0.97058824 0.92647059 0.94117647 0.95588235 0.95588235 0.97058824
 0.94029851 0.94029851 0.82089552 0.97014925] 

Accuracy: 93.92 (+/- 8.40)
In [64]:
svc.fit(pca_X_train,pca_y_train)

print("PCA data Training Accuracy :\t ", svc.score(pca_X_train, pca_y_train))

pca_train_accuracy=svc.score(pca_X_train, pca_y_train)

#Scoring the model on test_data
print("PCA data Testing Accuracy :\t  ",  svc.score(pca_X_test, pca_y_test))

pca_test_accuracy=svc.score(pca_X_test, pca_y_test)
PCA data Training Accuracy :	  0.9689349112426036
PCA data Testing Accuracy :	   0.9529411764705882
In [65]:
print(classification_report(pca_y_test, svc.predict(pca_X_test)))
              precision    recall  f1-score   support

           0       0.96      0.96      0.96        90
           1       0.95      0.95      0.95        37
           2       0.95      0.95      0.95        43

    accuracy                           0.95       170
   macro avg       0.95      0.95      0.95       170
weighted avg       0.95      0.95      0.95       170

In [66]:
result = pd.DataFrame({'TrainTest' : ['raw_train_accuracy', 'raw_test_accuracy', 'pca_train_accuracy','pca_test_accuracy'], 
                       'Accuracy' : [raw_train_accuracy,raw_test_accuracy, pca_train_accuracy, pca_test_accuracy],
                      })
result
Out[66]:
TrainTest Accuracy
0 raw_train_accuracy 0.971893
1 raw_test_accuracy 0.964706
2 pca_train_accuracy 0.968935
3 pca_test_accuracy 0.952941

Observation :-
From above we can see that by reducing 10 dimension we are achieving 94% accuracy and accuracy score dropped very less around 2%.

With dropping the above mentioned columns Manually

In [67]:
#drop the columns
X_scaled.drop(['max.length_rectangularity','scaled_radius_of_gyration','skewness_about.2','scatter_ratio','elongatedness','pr.axis_rectangularity','scaled_variance','scaled_variance.1'],axis=1,inplace=True)
In [68]:
#display the shape of new dataframe
X_scaled.shape
Out[68]:
(846, 10)
In [69]:
dropcolumn_X_train,dropcolumn_X_test,dropcolumn_y_train,dropcolumn_y_test = train_test_split(X_scaled,Y,test_size=0.20,random_state=1)
In [70]:
print("shape of dropcolumn_X_train",dropcolumn_X_train.shape)
print("shape of dropcolumn_y_train",dropcolumn_y_train.shape)
print("shape of dropcolumn_X_test",dropcolumn_X_test.shape)
print("shape of dropcolumn_y_test",dropcolumn_y_test.shape)
shape of dropcolumn_X_train (676, 10)
shape of dropcolumn_y_train (676,)
shape of dropcolumn_X_test (170, 10)
shape of dropcolumn_y_test (170,)
In [71]:
#fit the model on dropcolumn_X_train,dropcolumn_y_train
svc.fit(dropcolumn_X_train,dropcolumn_y_train)
Out[71]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [72]:
#predict the y value
dropcolumn_y_predict = svc.predict(dropcolumn_X_test)
In [73]:
#display the accuracy score and confusion matrix
print("Accuracy score with dropcolumn data(10 dimension)",accuracy_score(dropcolumn_y_test,dropcolumn_y_predict))
print("Confusion matrix with dropcolumn data(10 dimension)\n",confusion_matrix(dropcolumn_y_test,dropcolumn_y_predict))
Accuracy score with dropcolumn data(10 dimension) 0.9235294117647059
Confusion matrix with dropcolumn data(10 dimension)
 [[86  2  2]
 [ 1 35  1]
 [ 7  0 36]]

Conclusion:
From above we can see that pca is doing a very good job.Accuracy with pca is approx 95% and with raw data approx 96% but note that pca 94% accuracy is with only 8 dimension where as rawdata has 18 dimension.

PCA has done great job even after reduction of 10 attribute , we have got the accuracy about 95% and accuracy drop from raw data about 1% only

In [ ]: